Introduction

This project is a tutorial of using multivaraite time series analysis for the stock market index, NIFTY 50 from NSE (National Stock Exchange) India. The data is obtained from Nifty 50 contains price history and trading volumes of fifty stocks in India from 2000-01-03 to 2020-09-30.

We illustrates how to using Python, R, and Stata to apply Auto Regressive Integrated Moving Average (ARIMA) to time series data. ARIMA is able to fit a given non-seasonal non-stationary time series based on its lag values. A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.

A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.

An ARIMA model is characterized by 3 terms: p (the order of AR term), q (the order of the MA term), and d (number of differencing to make time series stationary). Given a time series \(\{X_t\}\), an \(ARIMA(p, d, q)\) model can be expressed as: \[(1-\sum_{i=1}^p\phi_iL^i)(1-L)^dX_t= (1+\sum_{i=1}^q\theta_iL^i)\epsilon_t + \delta\] where \(\epsilon_t\) is the error term, \(L\) is the lag operator, i.e. \(LX_t = X_{t-1}, \forall t>1\), \(p\) is the number of lagged terms of \(X\), \(d\) is the number of times of differencing needed for stationarity, \(q\) is the number of lagged forecast errors in prediction, \(\delta\) is the interception term for the regression, and \(\theta, \phi\)’s are the estimated regression coefficients.

ARIMA models are fitted in order to understand the data better and forecast future data. They are based on linear regression models. The best model can be chosen using AIC or BIC.

Data Description

NIFTY 50 data consist of 50 stocks, 230104 observations on 15 variables. The data contains daily open, close, highest and lowest prices, volume and other relevant information for the “Nifty Fifty” stocks since January 2000. Detailed variable descriptions are shown in Table 1 below.

Click to see variable descriptions.
Table 1. Variable descriptions of NIFTY 50 data
Variable Name Variable Description
Date Date of trade
Symbol Name of the company
Series We have only one series: Equity(EQ)
Prev Close Previous day’s close price
Open Open price of day
High Highest price in day
Low Lowest price in day
Last Last traded price in day
Close Close price of day
VWAP Volume Weighted Average Price
Volume A measure of sellers versus buyers of a particular stock
Turnover The number of shares available for sale
Trades The number of shares traded
Deliverable Volume Shares which are actually transferred among demat accounts
%Deliverable Percent of deliverble volume
Return Return of trade

As we are more interested in the stock prices, we use the variable VWAP for the most part. It can summarize the average price of the stock on a trading day. We want to catch the trend of stock prices across the years and possibly forecast future stock prices.

library(dplyr)
library(ggplot2)
old_sym = c('MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
       'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
       'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE')
new_sym = c('ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
       'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
       'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL')
# summary statistics of variable of interest
nifty = nifty %>%
  mutate(Symbol = lapply(Symbol, 
                         function(x) replace(x, x %in% old_sym, new_sym)),
         Trades = ifelse(is.na(Trades), 0, Trades),
         Return = Close - `Prev Close`) %>% 
  select(Date, Symbol, Return, VWAP, Volume, Trades)
summary(nifty[, -2])
##       Date                Return               VWAP         
##  Min.   :2000-01-03   Min.   :-19525.95   Min.   :    9.21  
##  1st Qu.:2006-05-23   1st Qu.:    -5.80   1st Qu.:  272.43  
##  Median :2011-06-07   Median :     0.05   Median :  554.00  
##  Mean   :2011-02-20   Mean   :     0.25   Mean   : 1224.17  
##  3rd Qu.:2016-02-05   3rd Qu.:     6.40   3rd Qu.: 1213.47  
##  Max.   :2020-09-30   Max.   :  2107.70   Max.   :32975.24  
##      Volume              Trades       
##  Min.   :        3   Min.   :      0  
##  1st Qu.:   209105   1st Qu.:      0  
##  Median :   973757   Median :    235  
##  Mean   :  2745270   Mean   :  28762  
##  3rd Qu.:  2836929   3rd Qu.:  41776  
##  Max.   :481058927   Max.   :1643015
# plot trend of all stocks
fig.cap1 = "**Figure 1.** Daily trend of all stocks, 2000-2020."
nifty_ts = reshape2::melt(nifty[, -c(2)], id.vars = "Date")
ggplot(nifty_ts, aes(x = Date, y = value)) + 
    geom_line(color="lightblue") + 
    theme_bw() +
    facet_wrap(~ variable, scales = "free_y", ncol = 1)
**Figure 1.** Daily trend of all stocks, 2000-2020.

Figure 1. Daily trend of all stocks, 2000-2020.

From the trend of all stocks, we can see the time series exhibit non-stationarity. There was a substantial strike to the India stock market after the outbreak of Coronavirus.

Core Example

Python

Set up

The main packages used in this analysis are:

  • pandas: For data cleaning and data frame mutations
  • numpy: For numeric data processing
  • datetime: For date format data processing
  • sklearn: For missing data imputing
  • statsmodels and pmdarima: For the main ARIMA model
  • matplotlib: For making plots

Data cleaning and visualization

Firstly, we import the data.

import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
# Just to make sure the packages are loaded.
import pandas as pd
import numpy as np
df = pd.read_csv('./NIFTY50_all.csv')

The variable Trades has 50% missing values, we can delete it. We can delete redundant variables and impute %Deliverable via Simple Imputation because there are not so many missing values (about 7%). We also need to merge different symbols for the same stock.

from datetime import datetime # Dealing with date format data
from sklearn.impute import SimpleImputer # Impute data
# Drop redundant variables and variables with too many missing values
df['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in df['Date']]
df1 = df.drop(['Trades', 'Deliverable Volume', 'Series'], axis=1)
ls1 = ['MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
       'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
       'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE']
ls2 = ['ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
       'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
       'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL']
df1['Symbol'] = df1['Symbol'].replace(ls1, ls2)
df1['Symbol'] = pd.Categorical(df1['Symbol'])
# Impute missing values
df2 = pd.get_dummies(data=df1, drop_first=True)
df2['Date']=df2['Date'].map(datetime.toordinal)
imp = SimpleImputer()
p = imp.fit_transform(df2)
df1['%Deliverble'] = p[:, 10]

We can take a look at the data. Take the stock “ADANIPORTS” as an example.

import matplotlib.pyplot as plt # Plotting package in python
names = df1['Symbol'].cat.categories
example = df1[df1['Symbol'] == names[0]]
fig, ax = plt.subplots(3, 1, figsize=(8, 8))
ax[0].plot(example['Date'], example['VWAP'])
ax[0].set_xticks([])
ax[0].set_xlabel('Days')
ax[0].set_ylabel('Volume weighted average price')
ax[1].plot(example['Date'], example['Volume'])
ax[1].set_xticks([])
ax[1].set_xlabel('Days')
ax[1].set_ylabel('Volume')
ax[2].plot(example['Date'], example['Turnover'])
ax[2].set_xticks([])
ax[2].set_xlabel('Days')
ax[2].set_ylabel('Turnover')
ax[0].set_title('Time series plots of stock %s' % names[0])

Figure 2.1. Time series plots of example stocks

Determine model parameters

We will use the time series VWAP for the analysis below.

The differencing parameter \(d\) of the model can be determined by doing Augmented Dickey-Fuller tests, which can indicate whether the time series are stationary. See the reference for more details about ADF tests. The python packages statsmodels.tsa and pmdarima.arima are very helpful here.

from statsmodels.tsa.arima_model import ARIMA 
from pmdarima.arima.utils import ndiffs # ARIMA model packages
# This function chooses the smallest d for the series to be stationary
names = df1['Symbol'].cat.categories
ls0 = []
for i in names:
    subdf = df1[df1['Symbol'] == i]
    # Select the rows for stock i
    ls0.append(ndiffs(subdf['VWAP'], test='adf'))
ls0  # Most values are 1
## [0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
max(ls0) # We don't need 2-order differencing
## 1

In order to ensure all time series are stationary, we choose \(d=1\) for all stocks.

The AR parameter \(p\) of the model can be determined by looking at Partial Autocorrelation plots. These plots indicate the correlation between the series and its lag. We use the first 4 stocks alphabetically as a sample. Notice the series need to be differenced first (\(d=1\)).

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
# Take the first 4 stocks as a sample
for i in range(4):
    subdf = df1[df1['Symbol'] == names[i]]
    # Select the rows for stock number i
    plot_pacf(subdf['VWAP'].diff().dropna(), ax=ax[i])
    ax[i].set_title('PACF plot for stock %s' % names[i])

Figure 2.2. PACF plots for choosing p

Notice lag 1 is at least borderline significant in the plots for all stocks, but lag 2 is not significant. Therefore, we can choose \(p=1\) for all stocks.

The MA parameter \(q\) of the model can be determined by looking at Autocorrelation (ACF) plots. We use the next 4 stocks alphabetically as a sample. Similarly, the series need to be differenced (\(d=1\)).

fig, ax = plt.subplots(1, 4, figsize=(15, 5))
for i in range(4):
    subdf = df1[df1['Symbol'] == names[i+4]]
    plot_acf(subdf['VWAP'].diff().dropna(), ax=ax[i])
    ax[i].set_title('ACF plot for stock %s' % names[i+4])

Figure 2.3. ACF plots for choosing q

Notice lag 1 is significant for most of the stocks but lag 2 is not. Therefore we can choose \(q=1\) for the MA term.

Fit models

According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. After fitting the models, we can view some of the summaries. Take the first 5 stocks alphabetically as an example.

mlist = [] # Models
flist = [] # Model fits
for i in names:
    subdf = df1[df1['Symbol'] == i]
    m = ARIMA(list(subdf['VWAP']), order=(1, 1, 1))
    mlist.append(m)
    flist.append(m.fit(disp=0))
for i in range(5):
    print(flist[i].summary())
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3178
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -13340.537
## Method:                       css-mle   S.D. of innovations             16.100
## Date:                Tue, 17 Nov 2020   AIC                          26689.074
## Time:                        21:08:29   BIC                          26713.330
## Sample:                             1   HQIC                         26697.773
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const         -0.2048      0.314     -0.652      0.514      -0.820       0.411
## ar.L1.D.y      0.1049      0.151      0.694      0.488      -0.192       0.401
## ma.L1.D.y     -0.0156      0.152     -0.103      0.918      -0.313       0.282
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1            9.5325           +0.0000j            9.5325            0.0000
## MA.1           64.1530           +0.0000j           64.1530            0.0000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 5162
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -29071.318
## Method:                       css-mle   S.D. of innovations             67.549
## Date:                Tue, 17 Nov 2020   AIC                          58150.637
## Time:                        21:08:29   BIC                          58176.833
## Sample:                             1   HQIC                         58159.803
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.3108      0.960      0.324      0.746      -1.570       2.192
## ar.L1.D.y     -0.1743      0.368     -0.473      0.636      -0.896       0.547
## ma.L1.D.y      0.1985      0.366      0.542      0.588      -0.519       0.916
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -5.7387           +0.0000j            5.7387            0.5000
## MA.1           -5.0380           +0.0000j            5.0380            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 5162
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -24320.029
## Method:                       css-mle   S.D. of innovations             26.908
## Date:                Tue, 17 Nov 2020   AIC                          48648.059
## Time:                        21:08:29   BIC                          48674.255
## Sample:                             1   HQIC                         48657.225
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.0766      0.399      0.192      0.848      -0.706       0.859
## ar.L1.D.y     -0.0007      0.237     -0.003      0.998      -0.465       0.464
## ma.L1.D.y      0.0665      0.237      0.281      0.779      -0.397       0.530
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1        -1485.7047           +0.0000j         1485.7047            0.5000
## MA.1          -15.0293           +0.0000j           15.0293            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3058
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -15798.297
## Method:                       css-mle   S.D. of innovations             42.406
## Date:                Tue, 17 Nov 2020   AIC                          31604.593
## Time:                        21:08:29   BIC                          31628.695
## Sample:                             1   HQIC                         31613.254
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.7437      0.831      0.895      0.371      -0.885       2.372
## ar.L1.D.y     -0.2225      0.127     -1.758      0.079      -0.470       0.026
## ma.L1.D.y      0.3245      0.122      2.654      0.008       0.085       0.564
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -4.4950           +0.0000j            4.4950            0.5000
## MA.1           -3.0812           +0.0000j            3.0812            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3057
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -17217.768
## Method:                       css-mle   S.D. of innovations             67.579
## Date:                Tue, 17 Nov 2020   AIC                          34443.535
## Time:                        21:08:29   BIC                          34467.636
## Sample:                             1   HQIC                         34452.196
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          1.7395      1.497      1.162      0.245      -1.195       4.674
## ar.L1.D.y     -0.1671      0.071     -2.350      0.019      -0.307      -0.028
## ma.L1.D.y      0.4298      0.066      6.537      0.000       0.301       0.559
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -5.9832           +0.0000j            5.9832            0.5000
## MA.1           -2.3266           +0.0000j            2.3266            0.5000
## -----------------------------------------------------------------------------

Notice that for the first 3 stocks, the model fit is not good. We can consider removing the MA term since it’s non-significant for some stocks, i.e. choosing the \(ARIMA(1, 1, 0)\) model.

mlist0 = [] # Models
flist0 = [] # Model fits
for i in names:
    subdf = df1[df1['Symbol'] == i]
    m = ARIMA(list(subdf['VWAP']), order=(1, 1, 0))
    mlist0.append(m)
    flist0.append(m.fit(disp=0))

We use AIC has the model choosing criteria. The AIC decreases for the first three stocks, and increases for the 4th and 5th, indicating different stocks need different models.

includema = [] # Whether MA term should be included
for i in range(50):
    includema.append(flist0[i].aic > flist[i].aic)
pd.value_counts(includema)
## True     28
## False    22
## dtype: int64

Model diagnostics

Firstly, we can use the in-sample lagged values to predict the time series.

fig, ax = plt.subplots(10, 5, figsize=(15, 20))
for i in range(50):
    if(includema): # ARIMA(1,1,1)
        flist[i].plot_predict(dynamic=False, ax=ax[i // 5, i % 5])
    else: # ARIMA(1,1,0)
        flist0[i].plot_predict(dynamic=False, ax=ax[i // 5, i % 5])
    ax[i // 5, i % 5].set_title(names[i])
fig.tight_layout()
fig

Figure 2.4. Prediction plots

We can also forecast future VWAPs using the chosen models. For example, we can forecast the average prices in the next 200 trading days after the time series.

fig, ax = plt.subplots(10, 5, figsize=(15, 20))
for i in range(50):
    if(includema):
        forecast, b, ci = flist[i].forecast(200, alpha=0.05)
    else:
        forecast, b, ci = flist0[i].forecast(200, alpha=0.05)
    subdf = df1[df1['Symbol'] == names[i]]
    ax[i // 5, i % 5].plot(list(subdf['VWAP']))
    idx = range(len(subdf['VWAP']), 200+len(subdf['VWAP']))
    ax[i // 5, i % 5].plot(idx, forecast)
    ax[i // 5, i % 5].fill_between(idx, ci[:, 0], ci[:, 1], 
                 alpha=0.15)
    ax[i // 5, i % 5].set_title(names[i])
    ax[i // 5, i % 5].set_xticks([])
fig.tight_layout()
fig

Figure 2.5. Forecast plots

Notice the confidence intervals are very wide, indicating it’s not easy to forecast stock prices.

Model improvement

Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.

We can use auto_arima to choose models. It compares different models and chooses the best one. Again, we use ADF test to determine \(d\), and AIC to determine \(p,q\). Take “ADANIPORTS”, “ASIANPAINT” and “BPCL” as examples.

import pmdarima as paim
subdf = df1[df1['Symbol'] == names[0]]
m1 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3)
m1.summary() # The chosen model was ARIMA(1, 0, 1), which is a good fit.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 3179
## Model:               SARIMAX(1, 0, 1)   Log Likelihood              -13346.144
## Date:                Tue, 17 Nov 2020   AIC                          26700.287
## Time:                        21:08:50   BIC                          26724.545
## Sample:                             0   HQIC                         26708.987
##                                - 3179                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## intercept      1.1366      1.144      0.993      0.321      -1.106       3.379
## ar.L1          0.9971      0.002    572.414      0.000       0.994       1.001
## ma.L1          0.0892      0.005     16.587      0.000       0.079       0.100
## sigma2       259.0152      0.418    619.397      0.000     258.196     259.835
## ===================================================================================
## Ljung-Box (Q):                       52.63   Jarque-Bera (JB):         120511061.15
## Prob(Q):                              0.09   Prob(JB):                         0.00
## Heteroskedasticity (H):               0.06   Skew:                           -22.97
## Prob(H) (two-sided):                  0.00   Kurtosis:                       955.73
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
ls0[0] # Indeed, differencing is not needed for the stock 'ADANIPORTS'.
## 0
subdf = df1[df1['Symbol'] == names[1]]
m2 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3)
m2.summary() # The chosen model was ARIMA(0, 1, 0), which is a good fit.
# For 'ASIANPAINT', the 1-order difference series is close to a constant.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 5163
## Model:               SARIMAX(0, 1, 0)   Log Likelihood              -29072.936
## Date:                Tue, 17 Nov 2020   AIC                          58147.873
## Time:                        21:08:51   BIC                          58154.422
## Sample:                             0   HQIC                         58150.164
##                                - 5163                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## sigma2      4564.8605      1.982   2303.040      0.000    4560.976    4568.745
## ===================================================================================
## Ljung-Box (Q):                       42.98   Jarque-Bera (JB):        3632346684.27
## Prob(Q):                              0.34   Prob(JB):                         0.00
## Heteroskedasticity (H):               4.65   Skew:                           -60.55
## Prob(H) (two-sided):                  0.00   Kurtosis:                      4110.73
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
subdf = df1[df1['Symbol'] == names[7]]
m3 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3, error_action='ignore')
m3.summary() # The chosen model was ARIMA(1, 1, 2), which is a good fit.
# For 'BPCL' second order MA term is needed.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 5163
## Model:               SARIMAX(1, 1, 2)   Log Likelihood              -21047.069
## Date:                Tue, 17 Nov 2020   AIC                          42102.139
## Time:                        21:09:09   BIC                          42128.335
## Sample:                             0   HQIC                         42111.305
##                                - 5163                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## ar.L1          0.9703      0.019     51.948      0.000       0.934       1.007
## ma.L1         -0.9021      0.020    -45.739      0.000      -0.941      -0.863
## ma.L2         -0.0779      0.008    -10.104      0.000      -0.093      -0.063
## sigma2       203.7130      0.511    398.595      0.000     202.711     204.715
## ===================================================================================
## Ljung-Box (Q):                       31.14   Jarque-Bera (JB):          97447842.12
## Prob(Q):                              0.84   Prob(JB):                         0.00
## Heteroskedasticity (H):               5.01   Skew:                           -18.28
## Prob(H) (two-sided):                  0.00   Kurtosis:                       675.11
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """

We can improve each of the models invidivually for slightly better forecast performance.

R

Autocorrelation Plots

We will focus on the variables: Symbol, VWAP, Volume, Trades and a newly created variable Return, which is the difference between Close and Prev Close. First we choose one stock to analyze its autocorrelation of trend on theabove four variables. The trend plot above shows large value of Volume, sowe can take log to decrease its trend.

library(forecast)
candidate = "ADANIPORTS"
nifty_cand = nifty %>% filter(Symbol == candidate) %>% 
  mutate(Volume = log(Volume))
vars_list = names(nifty_cand)[-c(1, 2)]
fig.cap2 = "**Figure 3.1.** ACF plots of stock: ADANIPORTS."
par(mfrow=c(2,2))
for ( var in vars_list ) {
  df_ts = ts(nifty_cand[[var]], frequency = 1, start = c(2000, 01, 03))
  # acf
  acf_tou <- acf(df_ts, lag.max = 30, plot = FALSE)
  plot(acf_tou, xlab = "Lag (in Year)", main = var)
}
**Figure 3.1.** ACF plots of stock: ADANIPORTS.

Figure 3.1. ACF plots of stock: ADANIPORTS.

Notice that all the variables show high autocorrelation except for Return, which is because Return is calculated from the first difference of closing price working as a linear filter applied to eliminate a trend. Since we are going to apply ARIMA model to the data, which can works for non-stationary time series, we can leave the model to detrend time seires. For the illustration purpose, we will take first difference of other three variables and compare the autocorrelation plot to the previous one.

fig.cap3 = "**Figure 3.2.** ACF plots of first difference of time seires."
acf_diff_plot(df = nifty, sym = candidate)
**Figure 3.2.** ACF plots of first difference of time seires.

Figure 3.2. ACF plots of first difference of time seires.

Fitting an ARIMA Model

We can choose the ARIMA Model by AIC. Let’s tabulate some AIC values for a range of different choices of p and q, assuming d takes 0 for Return while 1 for other 3 variables. Below shows the AIC table of fitting ARIMA on Return time series of stock: “ADANIPORTS”. We will subset the last 120 time series as test data.

Table 2. AIC for different ARIMA parameters
MA0 MA1 MA2 MA3 MA4
AR0 336526.9 336472.5 336467.4 336418.2 336397.1
AR1 336471.1 336472.2 336442.9 336160.5 336146.7
AR2 336470.3 336449.3 336458.8 335870.8 335872.1
AR3 336415.3 336121.2 335870.7 335872.7 335784.7
AR4 336400.3 336095.4 336125.2 335783.9 335522.1

The AIC table suggests that ARIMA(4, 0, 4) is the best model for the return of “ADANIPORTS”. This model may give a sign that increasing p and q will tend to get smaller AIC for a better fit. However, models with higher p and q are more complex, so it may lead to problems like overfitting, numerical stability and etc. We usually prefer a simply model, which also better for interpretation.

Even though it is nice to view the change of AIC value as the change of p and q, for a big dataset like this, it is very inefficient to iterate over range of p and q. auto.arima()in the forest package is much faster in generating the results. We can verify that auto.arima() also suggests ARIMA(4, 0, 0) is the model with the best fit in the range of (1, 4) of p and q.

ts_arima <- auto.arima(head(nifty_cand_ts, -30), max.p = 4, 
                      max.q = 4, max.d = 3)
print(ts_arima)
## Series: head(nifty_cand_ts, -30) 
## ARIMA(4,0,4) with zero mean 
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ma1      ma2      ma3     ma4
##       0.0755  1.0253  0.0893  -0.7314  -0.0316  -1.0385  -0.2077  0.7288
## s.e.  0.0183  0.0157  0.0150   0.0128   0.0195   0.0160   0.0145  0.0146
## 
## sigma^2 estimated as 5155:  log likelihood=-167751.7
## AIC=335521.5   AICc=335521.5   BIC=335596.1

Forecasting using an ARIMA Model

Lastly, we will forecast the next 120 time series and compare the result with our test set.

ts_forecasts <- forecast(ts_arima, h = 30) 
acc <- accuracy(ts_forecasts, head(tail(nifty_cand_ts, 30), 7))
print(round(acc, 4))
##                   ME    RMSE     MAE      MPE    MAPE   MASE    ACF1
## Training set -0.4800 71.7890 15.4322      NaN     Inf 0.7424 -0.0081
## Test set      3.0744  5.5276  4.5046 -49.0596 206.901 0.2167 -0.3585
##              Theil's U
## Training set        NA
## Test set        0.9729

The RMSE and MAE for the test set are 5.5276 and 4.5046, respectively.

fig.cap4 = "**Figure 3.3.** Forecasts from ARIMA(4, 0, 4) with zero mean."
autoplot(ts_forecasts, main = "") + xlab("Day") + 
  ylab("Return") +
  theme_bw()
**Figure 3.3.** Forecasts from ARIMA(4, 0, 4) with zero mean.

Figure 3.3. Forecasts from ARIMA(4, 0, 4) with zero mean.

Stata

Data cleaning and visualization

Firstly, we import the data and do data cleanning. We drop the variable and %Deliverable. Also, we transform the variable from string type to date type to treat the whole data set as time series data set.

import delimited NIFTY50_all.csv, clear

* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"

* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"

* Save the cleaned data
save NIFTY_clean, replace 

Then we visualize the data and the stock “ADANIPORTS” is taken as an example.

use NIFTY_clean, clear

keep if symbol == "ADANIPORTS"

graph twoway line vwap date, color("blue") xtitle("Days") ///
  ytitle("Volume weighted average price")
graph export vwap_data.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export vwap_data.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ///
  ytitle("Turnover")
graph export vwap_data.png, replace

ada_vwap

ada_vwap ada_vwap

Figure 4.1. Time series plots of ADANIPORTS

Determine model parameters

We will use the time series VWAP for the analysis below.

For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.

use NIFTY_clean, clear

foreach sym of local sbls {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    dfuller d1.vwap
}

We do the test on the with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.

Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots.

Note: we will only plot the first 8 stocks as an example.

use NIFTY_clean, clear

levelsof(symbol), local(sbls)

local sbls_f8 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV ///
  BAJFINANCE BHARTIARTL BPCL"

foreach sym of local sbls_f8 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    ac D.vwap
    graph export acf_`sym'.png
    pac D.vwap
    graph export pacf_`sym'.png
}

The PACF plots for the first 4 stocks are the following: pacf_ADA pacf_ASI pacf_AXI pacf_BAJ_A

Figure 4.2. ACF and PACF plots of ADANIPORTS

And the ACF plots for the next 4 stocks are the following:

acf_BAJ_F acf_BAJF acf_BHA acf_BPC

Figure 4.3. ACF and PACF plots for BAJ_F, BAJF, BHA, BPC

We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.

Fit models

According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.

Note: we will only plot the first 5 stocks as an example.

use NIFTY_clean, clear

local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"

foreach sym of local sbls_f5 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    arima vwap, arima(1,1,1)
    estat ic
    mat l_aim = r(S)
    scalar aic_aim = l_aim[1,5]
    arima vwap, arima(1,1,0)
    estat ic
    mat l_ai = r(S)
    scalar aic_ai = l_aim[1,5]
    if aic_aim > aic_ai {
        arima vwap, arima(1,1,0)
        predict vwap_p
        gen vwap_p = vwap_pd + vwap
        graph twoway line vwap date, lwidth("vthin") color("blue")///
          || line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png
    } 
    else {
        arima vwap, arima(1,1,1)
        predict vwap_pd
        gen vwap_p = vwap_pd + vwap
        graph twoway line vwap date, lwidth("vthin") color("blue")///
          || line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png
    }
}

The sample fitted graphs are:

fit_ADA fit_ASI fit_AXI fit_BAJ_A fit_BAJ_F

Figure 4.4. Fitted plots for the first 5 stocks

Model improvement

Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.

However, Stata does not have some similar funciton as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.

To-do list

  • Prune codes, work on warnings
  • Add writings
  • Wrap up the whole project, finish conclusion

References

  1. A modern Time Series tutorial: Link

  2. ARIMA model in Wikipedia: Link

  3. ADF test in Wikipedia: Link